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Abstract 



We present a stochastic setting for optimization problems with nonsmooth convex separable objec- 
tive functions over linear equality constraints. To solve such problems, we propose a stochastic 
Alternating Direction Method of Multipliers (ADMM) algorithm. Our algorithm applies to a more 
general class of nonsmooth convex functions that does not necessarily have a closed-form solution 
by minimizing the augmented function directly. We also demonstrate the rates of convergence for 
our algorithm under various structural assumptions of the stochastic functions: 0(1/ yt) for convex 
functions and 0(\ogt/t) for strongly convex functions. Compared to previous literature, we estab- 
lish the convergence rate of ADMM algorithm, for the first time, in terms of both the objective value 
and the feasibility violation. 



1 Introduction 



The Alternating Direction Method of Multipliers (ADMM) JT] |2] is a very simple computational method for con- 
strained optimization proposed in 1970s. The theoretical aspects of ADMM have been studied from 1980s to 90s and 
its global convergence was established in the literature EIUEI. As reviewed in the comprehensive paper |]6], with its 
capacity of dealing with objective functions separately and synchronously, this method turned out to be a natural fit 
in the field of large-scale data-distributed machine learning and big-data related optimization and therefore received 
significant amount of attention in the last few years. Intensive theoretical and practical advances are conducted there- 
after. On the theoretical hand, ADMM is recently shown to have a rate of convergence of 0(1/ N) |7 8 9 , 10 |, where 
N stands for the number of iterations. On the practical hand, ADMM has been applied to a wide range of applica- 
tion domains, such as compressed sensing ifTTI . image restoration lfT2l . video processing and matrix completion lTT~3l . 
Besides that, many variations of this classical method have been recently developed, such as linearized lf]~3l [14] H"5l . 
accelerated [131 , and online IfTOl ADMM. However, most of these variants including the classic one implicitly assume 
full accessibilty to true data values, while in reality one can hardly ignore the existence of noise. A more natural way 
of handling this issue is to consider unbiased or even biased observations of true data, which leads us to the stochastic 
setting. 



'A short version appears in the 5th NIPS Workshop on Optimization for Machine Learning, Lake Tahoe, Nevada, USA, 2012. 
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1.1 Stochastic Setting for ADMM 



In this work, we study a family of convex optimization problems in which our objective functions are separable and 
stochastic. In particular, we are interested in solving the following linear equality-constrained stochastic optimization: 

min E$0i (x, £) + 9 2 (y) s.t. Ax + By = b, (1) 

xeA^yGjl 

where x e R dl , y e R d2 , A e K mxdl , B e W nxd2 , b e K m , A" is a convex compact set, and J 7 is a closed convex 
set. We are able to draw a sequence of identical and independent (i.i.d.) observations from the random vector £ that 
obeys a fixed but unknown distribution P. One can see that when £ is deterministic, we can recover the traditional 
problem setting for ADMM [6|. Denote the expectation function #i(x) = E^0i(x, £). In our most general setting, 
real-valued functions 9% (•) and 9 2 (■) are convex but not necessarily continuously differentiable. 

Note that our stochastic setting of the problem is quite different from that of the Online ADMM proposed 
in IflOll . In Online ADMM, one does not assume £ to be i.i.d., nor the objective to be stochastic, but in- 
stead, a deterministic concept referred as regret is concerned: i?(x[ 1:t j) = Y^k=i [^i( x /c, + #2(yfe)] — 
infAx+By^b 



1.2 Our Contributions 



In this work, we propose a stochastic setting of the ADMM problem and design the Stochastic ADMM algorithm. 
A key algorithmic feature of our Stochastic ADMM that distinguishes it from previous ADMM and variants is the 
first-order approximation of 8\ that we use to modify the augmented Lagrangian. This simple modification not only 
guarantees the convergence of our stochastic method, but also benefits to a more general class of convex objective 
functions which might not have a closed-form solution in minimizing the augmented 9\ directly. For example, with 
stochastic ADMM, we can derive close-form updates for the nonsmooth hinge loss function (used in support vector 
machines). However, with deterministic ADMM, one has to call SVM solvers during each iteration []6j, which is 
indeed very time-consuming. One of our main contributions is that we develop the convergence rates of our algorithm 
under various structural assumptions. For convex 9\{-), the rate is proved to be 0(l/y/i); for strongly convex 
the rate is proved to be 0(\ogt/t). To the best of our knowledge, this is the first time that convergence rates of ADMM 
are established for both the objective value and the feasibility violation. By contrast, recent research |H][T0] only shows 
the convergence of ADMM indirectly in terms of the satisfaction of variational inequalities. 



1.3 Notations 



Throughout this paper, we denote the subgradients of 9i and 9 2 as 9[ and 9' 2 . When they are differentiable, we will 
use V#i and V9 2 to denote the gradients. We use the notation 9\ both for the instance function value 9\ (x, £) and for 
its expectation #i(x). We denote by 9(u) = 6*i(x) + 6*2(y) the sum of the stochastic and the deterministic functions. 
For simplicity and clarity, we will use the following notations to denote stacked vectors or tuples: 



u 



11; 





Wfc 



/ lyk 

k l~<i 
\ k Si 



=iyj 
1 a, 



w = 



X 

y 



F(w) = 




(2) 



For a positive semidefinite matrix G € M dlX<il , we define the G-norm of a vector x as ||x||g := ||G 1 / 2 x||2 = V x T Gx. 
We use (•, •) to denote the inner product in a finite dimensional Euclidean space. When there is no ambiguity, we often 
use || ■ || to denote the Euclidean norm || ■ \\ 2 - We assume that the optimal solution of (fl]i exists and denote it as 

= (x;f , y;f ) . The following quantity appear frequently in our convergence analysis: 



= sup II x a 



0i(x fe -i), 



x b 



\B(y 



(3) 
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1.4 Assumptions 

Before presenting the algorithm and convergence results, we list the assumptions that will be used in our statements. 



Assumption 1. For all x G X, E ^(x, £ 



<M 2 . 



Assumption 2. For all x £ X, E 



exp{||^(x,OH 2 /M 2 }] <ex P {l}. 



Assumption 3. For all x e A", E [||0i(x,£) - 6»i(x)|| 2 ] < a 2 . 

2 Stochastic ADMM Algorithm 

Directly solving problem ([TJ can be nontrivial even if £ is deterministic and the equality constraint is as simple as 
x y = 0. For example, using the augmented Lagrangian method, one has to minimize the augmented Lagrangian: 

min Cff{x,y,X)= min 0i(x) + 2 (y) - (A, Ax + By - b) + £|| Ax + By - b|| 2 , 

where (3 is a pre-defined penalty parameter. This problem is at least not easier than solving the original one. The (de- 
terministic) ADMM (AlgQ]) solves this problem in a Gauss-Seidel manner: minimizing Cp w.r.t. x and y alternatively 
given the other fixed, followed by a penalty update over the Lagrangian multiplier A. 

Algorithm 1 Deterministic ADMM 

0. Initialize yo and Ao = 0. 
for k — 0,1, 2,... do 

1. x fc+ i <- argminxgA- \ #i(x) + 



y fc+ i <- argmm ye;i ; 



*(y) 



(Ax + By k - b) 
(Ax fe+ i + By - b) 







Afc+i 
end for 



A fe - j8 (Ax fc+ i + By fc+ i - b). 



A variant deterministic algorithm named linearized ADMM replaces Line 1 of AlgQ]by 

x fc+1 <- argmin (^(x) + | ||(Ax + By fc - b) - A fc //3|| 2 + |||x - x^l 2 ,) , (4) 

where G € R dlXdl is positive semidefinite. This variant can be regarded as a generalization of the original ADMM. 
When G — 0, it is the same as AlgQ] When G = rl^ — (3A T A, it is equivalent to the following linearized proximal 
point method: 

x fe+1 <- argmin {^(x) + /?(x - x fe ) T [A T (Ax k + By k - b - \ k /f3)] + ^||x - x fc || 2 } . 

Note that the linearization is applied only to the quadratic function ||(Ax + By k — b) — X k / f3\\ 2 , but not to 9\. This 
approximation helps when Line 1 of Alg[T]does not produce a closed-form solution given the quadratic term. For 
example, let #i(x) = ||x||i and A not identity. 

As shown in Algf2] we propose a Stochastic Alternating Direction Method of Multipliers (Stochastic ADMM) algo- 
rithm. Our algorithm shares some features with the classical and the linearized ADMM. One can see that Line 2 and 
3 are essentially the same as before. However, there are two major differences in Line 1 . First, we replace 6\ (x) 
with a first-order approximation of 8i(x,£ k +i) at x^: 0i(xfe) + x T 0' 1 (x^, This approximation has the same 
flavour of the stochastic mirror descent [16 ] used for solving a one-variable stochastic convex problem. One important 
benefit of using this approximation is that our algorithm can be applied to nonsmooth objective functions, beyond the 
smooth and separable least squares loss used in lasso. Second, similar to the linearized ADMM ©, we add an /2-norm 
prox-function ||x — Xfe|| 2 but scale it by a time-varying stepsize As we will see in Section|3] the choice of this 

stepsize is crucial in guaranteeing a convergence. 
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Algorithm 2 Stochastic ADMM 



0. Initialize x , y and A = 0. 
for k = 0,1, 2,... do 

1. x fc+1 <- argmin x6Ar \ (6»i(x fc , x) + § (Ax + By fc - b) - 



y. i — argmin ye3 ; <j <9 2 (y) + § (ix l+ i + By - b) - ^ 



Afc+i 
end for 



A fc - /3 (Ax fc+ i + By k+1 - b). 



3 Main Results of Convergence Rates 

In this section, we will show that our Stochastic ADMM given in Alg|2] exhibits a rate 0(1/ \fi) of convergence in 
terms of both the objective value and the feasibility violation: E[#(u t ) — ^(u*) + p\\ Ax t + By t — b|| 2 ] = 0(1/ y/i). 
We extend the main result if more structural information of Q\ is available. 

Before we address the main theorem on convergence rates, we first present an upper bound of the variation of the 
Lagrangian function and its first order approximation based on each iteration points. 

Lemma 1. Vw 6 W, k > 1, we have 

fi(xk) + 9 2 (y k+1 ) 9(xx) + (w fc+1 - w) T F(w k+1 ) < ^+ill g i(*^+i)ll 2 

+ — L_ (||x fc - x|| 2 - ||x fc+1 - x|| 2 ) + t (||Ax + By k - b|| 2 - ||Ax + By k+1 - b|| 2 ) (5) 
+ (5 fe+ i,x - x fe ) + — (||A - X k \\j - ||A - A fc+1 || 2 ) . 

Utilizing this lemma we are able to obtain our main result shown as below. We present our main theorem of the 
convergence in two fashions, both in terms of expectation and probability satisfaction. 

Theorem 1. Let r\ k = ® x Mk > 1 and p > 0. 

Ivl v ^ ™ 

(i) Under Assumption^ we have\ft > 1, 

\[2DxM 0Dl R +p 2 /B 
E[0(u t ) - 0(u*) + P \\Axt + By t - b||] < Mi(t) + M 2 (t) = V J + ^ ^ , (6) 

(») Under Assumption\l\and\2\ we have for any f2 > 0, 

Prob \e(\i t ) -0(u*) + p\\Ax t + By t - b|| > f 1 + ^ + 2v / 2fij M x (t) + M 2 (t) j < 2exp{-0}, (7) 

Remark 1. Adapting our proof techniques to the deterministic case where no noise takes place, we are able to obtain 
a similar result for deterministic ADMM: 

PDl f> p 2 

Vp>0,t>l, 9(u t )-6(u*)+p\\Ax t +By t -b\\ 2 <^-^ + ^, (8) 

While resulting in a 0(1/ 1) convergence rate same as the existing literature [8. 9 10 1, the above finding is actually 
a significant advance in the theoretical aspects of ADMM. For the first time, the convergence of ADMM is proved 
in terms of objective value and feasibility violation. By contrast, the existing literature |]8] [9] [10] only shows the 
convergence of ADMM in terms of the satisfaction of variational inequalities, which is not a direct measure of how 
fast an algorithm reaches the optimal solution. 
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3.1 Extension: Strongly Convex Q\ 

When function Q\ (•) is strongly convex, the convergence rate of Stochastic ADMM can be improved to O (^—p- 
Theorem 2. When 6\ is /i-strongly convex with respect to || ■ ||, taking rjk — ^ in Alg$2\ under Assumption \T\ 



Vp>0,t>l we have E[6(u t ) - 0(u*) + p\\Ax t + By t - b|| 2 ] < 



3.2 Extension: Lipschitz Smooth 6*i 



M 2 logt , »D% PD; 



P 



fit ' 2t ' 2t 1 2{Sf 



Since the bounds given in Theorem[T]are related to the magnitude of subgradients, they do not provide any intuition 
of the performance in low-noise scenarios. With a Lipschitz smooth function 0\, we are able to obtain convergence 
rates in terms of the variations of gradients, as stated in As sumption [3] Besides, under this assumption we are able to 
replace the unusual definition of in (f2| with 

/ i v^ fe ^ \ 

(9) 




Theorem 3. When 9i(-) is L-Lipschitz smooth with respect to || • ||, taking i]k = tsttf; inAlg$2\ under Assump- 

tion\3\Vp> 0,t> lwehaveE[0(u t )-e(u*)+p\\A5it+Byt-h\\2\ < ^ Dxa + ^ 



^ T 2( T 2t ~ 2/3 f 



4 Summary and Future Work 



In this paper, we have proposed the stochastic setting for ADMM along with our stochastic ADMM algorithm. Based 
on a first-order approximation of the stochastic function, our algorithm is applicable to a very broad class of problems 
even with functions that have no closed-form solution to the subproblem of minimizing the augmented 9\ . We have 
also established convergence rates under various structural assumptions of Q\. 0(\j\ft) for convex functions and 
0(\ogt/t) for strongly convex functions. We are working on integrating Nesterov's optimal first-order methods flTTI to 
our algorithm, which will help in achieving optimal convergence rates. More interesting and challenging applications 
will be carried out in our future work. 



5 Appendix 

5.1 3-Points Relation 

Before proving LemmaQ] we will start with the following simple lemma, which is a very useful result by implementing 
Bregman divergence as a prox-function in proximal methods. 

Lemma 2. Let Z(x) : X — > R be a convex differentiate function with gradient g. Let scalar s > 0. For any vector u 
and v, denote their Bregman divergence as D(u, v) = w(u) — w(v) — (Vw(v), u — v). //Vu € X, 

x* = argminZ(x) + sD(x, u), (10) 

then 

(g(x* ), x* - x) < s [£>(x, u) - D(x, x* ) - D(x* , u)] . 

Proof. Invoking the optimality condition for (II Oi l, we have 

(g(x*) +sVD(x*,u),x-x*) > 0, Vxe X, 

which is equivalent to 

(g(x*),x*-x) <.s(V J D(x*,u),x-x*> 

= s (Vw(x*) - Vw(u), x - x*> 

= s [L>(x, u) - L>(x, x*) - L>(x* , u)] . 

□ 
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5.2 Proof of Lemma [J 

Proof. Due to the convexity of B\ and using the definition of S k , we have 

0i(x fe ) -0i(x) < (6[(x k ),Xk -x) = (0' 1 (x fc ,| fe+ i),x fe+ i -x) + (<J fe+1 ,x-x fe ) + (0i(xfc,£ fc+ i),x fe - x fc+i ) . 

(11) 

Applying Lemma|2]to Line 1 of Algf2]and taking D(u, v) = |||v — u|| 2 , we have 

(0i(x fe ,£ fe+1 ) + A T [p(Ax k+1 + By k - b) - X k ] ,x k+1 - x) 

< 7T— (l|x fe -x|| 2 - ||xfc+i -x|| 2 - ||x fc -x fc+1 || 2 ) ' '"' 

Combining (fTTT i and (fL2b we have 

0i(x fe ) - 0i (x) + (x fc+ i -x,-A T \ k+1 ) 

dm 

< (0 / x (x fc ,£fc + i),Xj fe+ i - x) + (5 fe+ i,x- x fe ) + (0' 1 (x fc ,^fe + i),Xfc - Xfc+i) + 
(xfe+i - x, A T [f3(Ax k+1 + By k+1 - b) - A fe ]) 

= (0i(x fc ,£fc+i) + A T [£(Ax fe+1 + By k - b) - \ k ],x k+1 - x) + (13) 

(<Jfc+i,x- x fc ) + (x - x k+1 ,f3A T B(y k - y k+1 )) + (9[(x k ,£ k+1 ),x k - x k+1 ) 
GJ l 

< (||x fc - x|| 2 - ||x fc+ i - x|| 2 - ||x fe+ i - x fc || 2 ) + (S k +x,x- x fe ) + 

2r] k+ i 

(x-x k+1 ,(3A T B(y k - yjt+i)) + (0i(Xfe,^ fe+ x),Xfe - x fc+1 ) 
We handle the last two terms separately: 

(x - x k+ i,/3A T B(y k - y k+1 )) =/3(Ax- Ax k+ll By k - By k+1 ) 

= t [(\\Ax + By k - b|| 2 - \\Ax + By k+1 - b|| 2 ) + (|| Ax k+1 + By k+1 - b|| 2 - \\Ax k+1 + By k - b|| 2 )] 
< t (|| Ax + By k - b|| 2 - \\Ax + By k+1 - b|| 2 ) + ^l|A fc+ i - A fe || 2 

(14) 

and 

t W ~v \ ^ ^+i\\0' 1 (x k ^ k+ i)\\ 2 ||x fc -x fc+1 || 2 

\Vi(x k ,t, k+ i),x k - x k+ i) < 1 , (15) 

2 277fc+i 

where the last step is due to Young's inequality. Inserting ([T4T > and ( fT5T > into ( fT3] >, we have 

0i(x fc ) - 0i (x) + (x fe+ i - x, -A T A fc+1 ) 

- 1 /|| II 2 II ||2\ i ^fc+lll^K^^fc+l)!! 2 ,e \ 

< (H Xfc ~ X H - W^k+i - x\\ ) + - +(5fe +1 ,x-Xfe) (16) 

+ I (||Ax + By k - b|| 2 - || Ax + Syfe+i - b|| 2 ) + i||A fc+1 - A fe || 2 , 

Due to the optimality condition of Line 2 in Algf2]and the convexity of 02, we have 

2 (y fc+ i) - 2 (y) + (y*+i - y, -5 T A fe+1 ) < 0. (17) 

Using Line 3 in AlgfJJ we have 

(Ajt+i - A, Ax k+ i + By k+1 - b) 

= -p (Afe+l - -V - Afc+l) 



= i fllA- A fc || 2 - ||A - A fe+ i|| 2 - ||A fe+1 - Afe| 



2/3 

Taking the summation of inequalities ( TTol l ( fTTI i and ( fTSb . we obtain the result as desired. □ 
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5.3 Proof of Theorem 1 

Proof, (i). Invoking convexity of Oi(-) and 9 2 {) and the monotonicity of operator F(-), we have Vw e fi: 

1 * 

6(u t ) - 0(u) + (w t - w) T F(w t ) < - J] ^i(x fe _x) + 2 (y fe ) - 0(u) + (w fe - w) T F(w fe )] 

fe=i 

1 

= - ^ [^(xfc) + 2 (y fc +i) - 0(u) + (w fe+1 - w) T F(w fe+1 )] 



(19) 



k=0 



Applying Lemma[T]at the optimal solution (x, y) = (x*,y*), we can derive from ( fT9b that, VA 

0(0*) - 0(u*) + (x f - Xst ) T (-A T A t ) + (y t - y*) T (-B T \ t ) + (A t - A) T (Ax t + By t - b) 

"^fc+llKfXfc^fc+l)!! 2 , 1 fu 112 ,| 1 1 2 \ I Is \ 

+ (d fe+ i,x* - x fc ) 



t V 2 



l^+Byo-blp + ^HA-Ao 112 



2/3 1 



< 



t-1 

fc=0 



%+i||^i( x fe)4 



(5 fe+ i,x* - x fc ) 



|Xfc+i - x* 



(20) 



1 

20' 



The above inequality is true for all A g R m , hence it also holds in the ball Bq — {A : || A||2 < /?}. Combing with the 
fact that the optimal solution must also be feasible, it follows that 

max {0(u t ) - 0(u*) + (x t - ^) T (-A T \ t ) + (y t - y,) T {-B T \ t ) + (A t - A) T (Ax t + By t - b)} 



max (0(u t ) - 0(u*) + Xf (Ax* + By* - b) - 
max (0(u t ) - 9(vu) ~ A T (Ax t + By t - b)} 



A T (Ax t + B yi -b)} 



(21) 



= 0(u t ) - 0(u*) + p|| Ax t + By t - b|| 2 

Taking an expectation over (|2TT > and using (f20b we have: 

E [0(u t ) - 0(u.) + p|ix, + By t - b|| 2 ] 



< E 



t-i 

A;=0 



< 



■ E 

1 / Af 

l /m* 
7 I ~Y 



/J_HA_v "- 



1 /£> 



max . 
AeB t 



II 2 



fc=l 



fc=l 



< 



Vi 



" k + ^J 


+1r- 






2f3t 










2/3t 




p 2 




2t 


20t 





x fe ; 



fc=0 



In the second last step, we use the fact that x^ is independent of £k+i, hence E^ fc+1 |^ [l fc] (<Jfc + i,x* — Xfc) = 
(%+i|«[i : *,l^+i' x * - x fe) =°- 
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(ii) From the steps in the proof of part (i), it follows that, 
O(ut)-0(u.)+p\\Ax t + Byt-b\\ 
< 1 Vk+i ||^(x fc ,$ fc+1 )|| 2 1 



k=0 

A t +B t + C t 



k=0 



t \2ri t 



D. 



2/3 



Note that random variables A t and B t are dependent on £r t i . 
Claim 1. For Six > 0, 



Prob [A t >(l + ^ % ] < exp{-^i}. 



fc=i 



(22) 



(23) 



Let afe = „/' ?fc — Vfc = 1, . . . , t, then < au < 1 and Efc=i a fc = !• Using the fact that {8k, Vfc} are independent 
and applying Assumption|2] one has 



E 



expi^> fe ||0i(xfc,&+i)||7M 2 



,fe=i 



fjE[exp{a fc ||0i(x fe ,a+i)l| 2 /^ 2 }] 



fe=l 
t 



< [J ( E [expdieiCxfe^fe+OllVM 2 }] ) (J ensen's Inequality) 



fe=i 
t 



< J] (exp{l}) Qfc = exp ^> fe \ = ex P {l} 



fc=i 



,fc=i 



Hence, by Markov's Inequality, we can get 



Prob ( A t > (l + Q^M-YsVk ) <ex P {-(l + tti)}E 



2t 



fc=i 



expi^« fe ||0' 1 (x fc ,a. +1 )|| 2 /M 2 



We have therefore proved Claim 1 . 
Claim 2. For il 2 > 0, 



Prob B t > 2Q 2 



DxM 

Vt 



,fe=i 



< exp <^ — 



< exp{— 



(24) 



In order to prove this claim, we adopt the following facts in Nemirovski's paper (16). 

Lemma 3. Given that for all k — l,,.,,t, (/. is a deterministic function of £ru with E [Cfc|C[fc— ill = and 
E [ ex P{Cfe/o-fc}|€[fc-i]] < exp{l}, we have 

(a) For 7 > 0, E [exp{ 7 C fe }|%-i]] < cxp{ 7 2 a 2 }, Vfc = l,...,t 

(b) Let S t = ELi Cfe then Prob{S t > ^ELi < CX P {-*?} • 

Using this result by setting £fc = (<$fe,x* — Xfe_i) , 5* = Efc=i Cfc> an d ^fc = 2DxM,\fk, we can verify that 
E [&!€[*-!]] =0and 

E[ex P {C, 2 /^}l%-i]] <E[exp{^||<5 fe || 2 /a 2 }|l[ fe -i]] < ex P {l}, 
since |a| 2 < ||x* - Xfe.ilpHSfcll 2 < D% (2\\9[(x k , £ fc+1 )|| 2 + 2Af 2 ). 
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Implementing the above results, it follows that 

Prob (s t > 2Cl 2 D x MVtj < exp 

Since St = tB t , we have 

Prob | B t > 2VL 2 n ) < exp 



Vt 



as desired. 

Combining d22"l i. d23l l and d24l i. we obtain 



Prob ^Err p (u t ) > (1 + Q x )^ + 2^2^^ + c)j < exp{-Q x } + exp {-^ 



where Err p (u t ) = 9(u t ) — 9(u*)+p\\Ax. t +By t —h\\2- Substituting f^i = f2,Sl 2 = 2^/Q, and plugging in rjk = 

we obtain (0 as desired. □ 

5.4 Proof of Theorem 2 

Proof. By the strong-convexity of #1 we have Vx: 

9i(x k ) - 9 1 (x) < (0i(x fc ),x fc - x) - |||x- x fc || 2 

= (6[(x k ,€ k +i),Xk+i -x) + (<J fe+ i,x-x fe ) + (6»i(x fe ,| fe+ i),xfe -Xfc + i) - ^||x-x fc || 2 . 

Following the same derivations as in Lemma[T]and Theorem Q](i), we have 

E [0(u t ) - 0(u,) + p||Ax t + By t - b|| 2 ] 



< 



4l£ 



fc=0 



%+lll#'l( x fc'£fc+l/ 



1 _ f£ \ x || 2 _ ll x fc+i ~ x * 

2f?fe+i 2 J 2rj k+1 



2 ■ 



/m 2 
2t 



^+E[max( II A- X Q \\l\ 

It lxeB o \20t "°J 

///," 



^2/3t 
t-i 



- 2t ^ uk t ^ 

fc=l ^ fc=0 



2 //(fc + l).. M2 

— llXfe -x.il 2 ||x fc+1 -x,|| 2 



2t 



2 

2/% 



< 



fit 



2t 



2t 



2/3t 



□ 



5.5 Proof of Theorem 3 

Proof. The Lipschitz smoothness of B\ implies that Vfc > 0: 

#i(x fc+ i) < 0i(x fe ) + (V0i(xk),Xfc+x - x fc ) + — ||x fe+1 - x fel 



(3) 



di (xfc) + (V6»i(x fc ,| fc+ i),x fc+ i - x fc ) - (<5 fc+ i,x fc+ i - x fe ) 



L ll I 

-||Xfc+l - Xfc I 
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It follows thatVxe X: 

6»i(x fe+ i) - 6»i(x) + (x fc+ i - x, -A T \ k+1 ) 

< 0i (x fe ) - 6>i (x) + (V^i(x fc ,| fe+ i),x fc+ i - Xfc) - (8 k+ i,x k+1 - x fe ) + ~||xfc+i - x fe || 2 + (x fc+ i - x, -A T \ k+ i) 

= 8 1 (x k ) - Oi (x) + (V0i(xfc,£jfc + i),x- x fe ) - (<J fe+1 ,x fc+ i - xfe) + — ||xfc+i - xfe|| 2 
+ [(V6»i(xfe,£fe + i),x fc+ i -x) + (xfc+i - x, -A T \ k+ i)] 

< (V6»!(xfe),x fe - x) + (V0i(xfc,£fc +1 ),x- Xfe) - (<5fe +1 ,x fe+1 - x fe ) + — ||x fc+1 - x fe || 2 
+ [(V6»i(xfe,^ fe+ i),Xfc + i -x) + (xfc+i - x, -A T A k+1 )] 

= (5 fe+ i,x- Xfe + i) + — ||x fc+ i - Xfe|| 2 + [(V6»i(xfe,^fe + i),Xfe + i - x) + (x fe+ i - x, -A T X k+1 )] 

= (S k+ i,x- Xfe +1 ) + -||x fe+ i - Xfe|| 2 + (x - x k+1 ,f3A T B(y k - y fe+ i)) 
+ (V0i(x fe , $ k+1 ) + A T [/3(Ax k+x + By k - b) - A fe ] , x fe+1 - x) 

< ^ X - Xfe 2 - X - Xfc +1 2 - -i-*^ Xfc +1 - Xfe 2 

2r?fe+i 2 
+ (x - x k+ i,/3A T B(y k - yfe+i)) + (<5fc+i,x - x fc+ i) . 

The last inner product can be bounded as below using Young's inequality, given that rj k +i < 

(5fe+i,x-x fc+1 ) = (£fe+i,x-Xfe) + (S k+1 ,x k -Xfc + i) 

< (Wx- Xfe) + 2(1/7?fc ^_ £) ll^+iH 2 + lNk+ z - x fc+ i|| 2 . 

Combining this with inequalities ( 1 141 1 7b and ( fT~8T >, we can get a similar statement as that of Lemma[T] 

0(u k+1 ) - 0(u) + (w fc+1 - w) T ^(w fe+1 ) < ' lh 



2(l/r/fe +1 - L) 

Xfe - x|| 2 - ||x fe+1 - xj| 2 ) + £ (|| Ax + Sy fc - b|| 2 - || Ax + By k+1 - b|| 2 ) 



2?7fc+i v T 7 2 

+ (<S fe+1 ,x - Xfe) + — (||A - Afe|| 2 - ||A - Afe + i|| 2 ) . 

The rest of the proof are essentially the same as Theorem Q~|(i), except that we use the new definition of flfe in Q. □ 
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